{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Training Pipeline\n",
    "[run_training_pipeline.ipynb](https://github.com/shibing624/MedicalGPT/blob/main/run_training_pipeline.ipynb)    | [Open In Colab](https://colab.research.google.com/github/shibing624/MedicalGPT/blob/main/run_training_pipeline.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Stage 1: Continue Pretraining\n",
    "\n",
    "第一阶段：PT(Continue PreTraining)增量预训练，在海量领域文本数据上二次预训练GPT模型，以适配领域数据分布\n",
    "\n",
    "注意：\n",
    "1. 此阶段是可选的，如果你没有海量领域文本，可以跳过此阶段，直接进行SFT阶段的有监督微调\n",
    "2. 我实验发现：做领域知识注入，SFT比PT更高效，也可以跳过PT阶段\n",
    "\n",
    "| Stage 1: Continue Pretraining   |  [pretraining.py](https://github.com/shibing624/MedicalGPT/blob/main/pretraining.py) | [run_pt.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_pt.sh)    |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 说明：\n",
    "以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。\n",
    "\n",
    "1. 生成模型：使用的是Qwen/Qwen2.5-0.5B\n",
    "2. 数据集：PT阶段使用的是中文天龙八部小说部分文本和英文书籍部分文本，位于`data/pretrain`文件夹"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 配置运行环境\n",
    "\n",
    "本地执行可注释以下配置环境的命令，colab执行要打开注释，用于配置环境\n",
    "\n",
    "colab建议使用T4 GPU训练，设置方式：`代码执行程序 -> 更改运行时类型 -> 运行时类型：Python3，硬件加速器：GPU，GPU类型：T4 -> 保存`\n",
    "\n",
    "步骤：\n",
    "1. 下载最新代码到本地\n",
    "2. 安装依赖包\n",
    "\n",
    "依赖包如下，保证最新版本：\n",
    "\n",
    "```\n",
    "loguru\n",
    "transformers\n",
    "sentencepiece\n",
    "datasets\n",
    "tensorboard\n",
    "tqdm\n",
    "peft\n",
    "trl\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone --depth 1 https://github.com/shibing624/MedicalGPT.git\n",
    "%cd MedicalGPT\n",
    "%ls\n",
    "!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage1 咱们开始吧\n",
    "\n",
    "训练步骤如下：\n",
    "\n",
    "1. 确认训练集\n",
    "2. 执行训练脚本\n",
    "\n",
    "训练脚本的执行逻辑如下：\n",
    "1. 导入依赖包\n",
    "2. 设置参数\n",
    "3. 定义各函数并加载训练集\n",
    "4. 加载模型和tokenizer\n",
    "5. 开始训练并评估\n",
    "6. 查看训练结果\n",
    "\n",
    "**以下参数可以根据你的GPU实际情况修改，当前参数是根据Colab的T4单卡GPU（16GB显存）配置的**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%ls ./data/pretrain/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python pretraining.py \\\n",
    "    --model_name_or_path Qwen/Qwen2.5-0.5B \\\n",
    "    --train_file_dir ./data/pretrain \\\n",
    "    --validation_file_dir ./data/pretrain \\\n",
    "    --per_device_train_batch_size 3 \\\n",
    "    --per_device_eval_batch_size 3 \\\n",
    "    --do_train \\\n",
    "    --do_eval \\\n",
    "    --use_peft True \\\n",
    "    --seed 42 \\\n",
    "    --bf16 \\\n",
    "    --max_train_samples 20000 \\\n",
    "    --max_eval_samples 10 \\\n",
    "    --num_train_epochs 1 \\\n",
    "    --learning_rate 2e-4 \\\n",
    "    --warmup_ratio 0.05 \\\n",
    "    --weight_decay 0.01 \\\n",
    "    --logging_strategy steps \\\n",
    "    --logging_steps 10 \\\n",
    "    --eval_steps 50 \\\n",
    "    --eval_strategy steps \\\n",
    "    --save_steps 500 \\\n",
    "    --save_strategy steps \\\n",
    "    --save_total_limit 3 \\\n",
    "    --gradient_accumulation_steps 1 \\\n",
    "    --preprocessing_num_workers 1 \\\n",
    "    --block_size 128 \\\n",
    "    --group_by_length True \\\n",
    "    --output_dir outputs-pt-v1 \\\n",
    "    --overwrite_output_dir \\\n",
    "    --ddp_timeout 30000 \\\n",
    "    --logging_first_step True \\\n",
    "    --target_modules all \\\n",
    "    --lora_rank 8 \\\n",
    "    --lora_alpha 16 \\\n",
    "    --lora_dropout 0.05 \\\n",
    "    --torch_dtype bfloat16 \\\n",
    "    --device_map auto \\\n",
    "    --report_to tensorboard \\\n",
    "    --ddp_find_unused_parameters False \\\n",
    "    --gradient_checkpointing True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%ls -lh outputs-pt-v1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型训练结果：\n",
    "- 使用lora训练模型，则保存的lora权重是`adapter_model.safetensors`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`\n",
    "- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python merge_peft_adapter.py \\\n",
    "    --base_model Qwen/Qwen2.5-0.5B --lora_model outputs-pt-v1 --output_dir merged-pt/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls -lh merged-pt/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%cat merged-pt/config.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Stage1 增量预训练完成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-06-15T13:56:17.081153Z",
     "start_time": "2023-06-15T13:56:17.032821Z"
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Stage 2: Supervised FineTuning\n",
    "\n",
    "第二阶段：SFT(Supervised Fine-tuning)有监督微调，构造指令微调数据集，在预训练模型基础上做指令精调，以对齐指令意图，并注入领域知识\n",
    "\n",
    "| Stage 2: Supervised Fine-tuning | [supervised_finetuning.py](https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py) | [run_sft.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_sft.sh)  |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### 说明：\n",
    "以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。\n",
    "\n",
    "1. 生成模型：使用的是Qwen/Qwen2.5-0.5B 或者 Stage1得到的预训练模型\n",
    "2. 数据集：SFT阶段使用的是使用的是Belle的1千条抽样数据，位于`data/finetune`文件夹"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Stage2 咱们开始吧\n",
    "\n",
    "训练步骤如下：\n",
    "\n",
    "1. 确认训练集\n",
    "2. 执行训练脚本\n",
    "\n",
    "训练脚本的执行逻辑如下：\n",
    "1. 导入依赖包\n",
    "2. 设置参数\n",
    "3. 定义各函数并加载训练集\n",
    "4. 加载模型和tokenizer\n",
    "5. 开始训练并评估\n",
    "6. 查看训练结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-06-15T13:58:38.966506Z",
     "start_time": "2023-06-15T13:58:38.778132Z"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls ./data/finetune"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python supervised_finetuning.py \\\n",
    "    --model_name_or_path merged-pt \\\n",
    "    --train_file_dir ./data/finetune \\\n",
    "    --validation_file_dir ./data/finetune \\\n",
    "    --per_device_train_batch_size 4 \\\n",
    "    --per_device_eval_batch_size 4 \\\n",
    "    --do_train \\\n",
    "    --do_eval \\\n",
    "    --use_peft True \\\n",
    "    --bf16 \\\n",
    "    --max_train_samples 1000 \\\n",
    "    --max_eval_samples 10 \\\n",
    "    --num_train_epochs 1 \\\n",
    "    --learning_rate 2e-5 \\\n",
    "    --warmup_ratio 0.05 \\\n",
    "    --weight_decay 0.05 \\\n",
    "    --logging_strategy steps \\\n",
    "    --logging_steps 10 \\\n",
    "    --eval_steps 50 \\\n",
    "    --eval_strategy steps \\\n",
    "    --save_steps 500 \\\n",
    "    --save_strategy steps \\\n",
    "    --save_total_limit 3 \\\n",
    "    --gradient_accumulation_steps 1 \\\n",
    "    --preprocessing_num_workers 1 \\\n",
    "    --output_dir outputs-sft-v1 \\\n",
    "    --overwrite_output_dir \\\n",
    "    --ddp_timeout 30000 \\\n",
    "    --logging_first_step True \\\n",
    "    --target_modules all \\\n",
    "    --lora_rank 8 \\\n",
    "    --lora_alpha 16 \\\n",
    "    --lora_dropout 0.05 \\\n",
    "    --torch_dtype bfloat16 \\\n",
    "    --device_map auto \\\n",
    "    --report_to tensorboard \\\n",
    "    --ddp_find_unused_parameters False \\\n",
    "    --gradient_checkpointing True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls -lh outputs-sft-v1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "模型训练结果：\n",
    "- 使用lora训练模型，则保存的lora权重是`adapter_model.safetensors`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`\n",
    "- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python merge_peft_adapter.py \\\n",
    "    --base_model merged-pt --lora_model outputs-sft-v1 --output_dir merged-sft/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls -lh merged-sft/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%cat merged-sft/config.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Stage2 SFT训练完成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-06-15T14:07:40.752635Z",
     "start_time": "2023-06-15T14:07:40.731186Z"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Stage 3: Reward Modeling\n",
    "\n",
    "第三阶段：RM(Reward Model)奖励模型建模，构造人类偏好排序数据集，训练奖励模型，用来对齐人类偏好，主要是\"HHH\"原则，具体是\"helpful, honest, harmless\"\n",
    "\n",
    "| Stage 3: Reward Modeling        |  [reward_modeling.py](https://github.com/shibing624/MedicalGPT/blob/main/reward_modeling.py) | [run_rm.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_rm.sh)    |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### 说明：\n",
    "以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。\n",
    "\n",
    "1. 生成模型：使用的是Qwen/Qwen2.5-0.5B 或者 Stage2得到的SFT模型\n",
    "2. 数据集：RM阶段使用的是医疗reward数据，抽样了500条，位于`data/reward`文件夹"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Stage3 咱们开始吧\n",
    "\n",
    "训练步骤如下：\n",
    "\n",
    "1. 确认训练集\n",
    "2. 执行训练脚本\n",
    "\n",
    "训练脚本的执行逻辑如下：\n",
    "1. 导入依赖包\n",
    "2. 设置参数\n",
    "3. 定义各函数并加载训练集\n",
    "4. 加载模型和tokenizer\n",
    "5. 开始训练并评估\n",
    "6. 查看训练结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls ./data/reward/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python reward_modeling.py \\\n",
    "    --model_name_or_path merged-sft \\\n",
    "    --train_file_dir ./data/reward \\\n",
    "    --validation_file_dir ./data/reward \\\n",
    "    --per_device_train_batch_size 1 \\\n",
    "    --per_device_eval_batch_size 1 \\\n",
    "    --do_train \\\n",
    "    --use_peft True \\\n",
    "    --seed 42 \\\n",
    "    --max_train_samples 1000 \\\n",
    "    --max_eval_samples 10 \\\n",
    "    --num_train_epochs 1 \\\n",
    "    --learning_rate 2e-5 \\\n",
    "    --warmup_ratio 0.05 \\\n",
    "    --weight_decay 0.001 \\\n",
    "    --logging_strategy steps \\\n",
    "    --logging_steps 10 \\\n",
    "    --eval_steps 50 \\\n",
    "    --eval_strategy steps \\\n",
    "    --save_steps 500 \\\n",
    "    --save_strategy steps \\\n",
    "    --save_total_limit 3 \\\n",
    "    --max_source_length 256 \\\n",
    "    --max_target_length 256 \\\n",
    "    --output_dir outputs-rm-v1 \\\n",
    "    --overwrite_output_dir \\\n",
    "    --ddp_timeout 30000 \\\n",
    "    --logging_first_step True \\\n",
    "    --target_modules all \\\n",
    "    --lora_rank 8 \\\n",
    "    --lora_alpha 16 \\\n",
    "    --lora_dropout 0.05 \\\n",
    "    --torch_dtype float16 \\\n",
    "    --fp16 \\\n",
    "    --device_map auto \\\n",
    "    --report_to tensorboard \\\n",
    "    --ddp_find_unused_parameters False \\\n",
    "    --remove_unused_columns False \\\n",
    "    --gradient_checkpointing True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls -lh outputs-rm-v1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "模型训练结果：\n",
    "- 使用lora训练模型，则保存的lora权重是`adapter_model.safetensors`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`\n",
    "- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python merge_peft_adapter.py \\\n",
    "    --base_model merged-sft --lora_model outputs-rm-v1 --output_dir merged-rm/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls -lh merged-rm/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%cat merged-rm/config.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Stage3 奖励建模第一次训练完成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-06-15T14:12:09.472414Z",
     "start_time": "2023-06-15T14:12:09.464881Z"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Stage 4: Reinforcement Learning Training\n",
    "\n",
    "第四阶段：RL(Reinforcement Learning)基于人类反馈的强化学习(RLHF)，用奖励模型来训练SFT模型，生成模型使用奖励或惩罚来更新其策略，以便生成更高质量、更符合人类偏好的文本\n",
    "\n",
    "| Stage 4: Reinforcement Learning |  [rl_training.py](https://github.com/shibing624/MedicalGPT/blob/main/rl_training.py) | [run_rl.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_rl.sh)    |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### 说明：\n",
    "以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型、奖励模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。\n",
    "\n",
    "1. 生成模型：使用的是Qwen/Qwen2.5-0.5B 或者 Stage2得到的SFT模型\n",
    "2. 奖励模型：使用的是`OpenAssistant/reward-model-deberta-v3-large-v2` 或者 Stage3得到的BERT类或者GPT类奖励模型\n",
    "3. 数据集：RL阶段的数据可以复用SFT的数据集，使用的是Belle的1千条抽样数据，位于`data/finetune`文件夹"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Stage4 咱们开始吧\n",
    "\n",
    "训练步骤如下：\n",
    "\n",
    "1. 确认训练集\n",
    "2. 执行训练脚本\n",
    "\n",
    "训练脚本的执行逻辑如下：\n",
    "1. 导入依赖包\n",
    "2. 设置参数\n",
    "3. 定义各函数并加载训练集\n",
    "4. 加载生成模型和tokenizer，加载奖励模型和其tokenizer\n",
    "5. 开始训练并评估\n",
    "6. 查看训练结果\n",
    "\n",
    "以下参数可以根据你的GPU实际情况修改，当前参数是根据Colab的T4单卡GPU（16GB显存）配置的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%ls ./data/finetune/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "! CUDA_VISIBLE_DEVICES=0 python ppo_training.py \\\n",
    "    --sft_model_path ./merged-sft \\\n",
    "    --reward_model_path ./merged-rm \\\n",
    "    --template_name qwen \\\n",
    "    --torch_dtype bfloat16 \\\n",
    "    --bf16 \\\n",
    "    --train_file_dir ./data/finetune \\\n",
    "    --validation_file_dir ./data/finetune \\\n",
    "    --max_source_length 256 \\\n",
    "    --response_length 1000 \\\n",
    "    --do_train \\\n",
    "    --save_steps 50 \\\n",
    "    --output_dir outputs-ppo-v1 \\\n",
    "    --num_train_epochs 3 \\\n",
    "    --report_to tensorboard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": "%ls -lh outputs-ppo-v1"
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "模型训练结果：\n",
    "- use_peft=False,默认是使用全参训练，模型保存的就是`model-00001-of-00002.safetensors`等文件，配置文件是`config.json`\n",
    "- use_peft=True, 则使用lora训练模型，则保存的lora权重是`adapter_model.safetensors`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`\n",
    "- 日志保存在`output_dir/trl`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/trl --host 0.0.0.0 --port 8009`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": "%ls -lh outputs-ppo-v1/"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": "%cat outputs-ppo-v1/config.json"
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Stage4 RL第一次训练完成。\n",
    "\n",
    "**至此一个完整的4阶段训练流程演示完成。**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "实际操作中Stage3和Stage4可以反复多次，直到RL得到的最后模型满足评估要求。\n",
    "\n",
    "RLHF过程可以把SFT模型当成一个初始化模型，RM模型当做指导老师，使用RL(PPO)调教SFT模型生成指导老师最满意的结果，如果小学老师满意了，我们就再训练一个中学老师，继续指导，中学老师满意了，就训练一个大学老师，这样不断迭代，使得生成模型的质量达到甚至超过人工撰写的天花板。\n",
    "\n",
    "RLHF训练不易，此项目提供给大家一种实现的方法和参考，希望抛砖引玉，共同促进中文开源LLM发展。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-06-26T12:34:29.658428Z",
     "start_time": "2023-06-26T12:34:29.620609Z"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-06-26T12:35:00.864463Z",
     "start_time": "2023-06-26T12:34:47.802087Z"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!python inference.py --base_model merged-ppo-v1\n",
    "# 或在shell中运行\n",
    "# !python inference.py --base_model merged-ppo-v1 --interactive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Input:介绍下南京\n",
    "Response:  南京市位于江苏省西南部，是全国首批历史文化名城、国家中心城市和自由贸易试验区。\n",
    "\n",
    "完。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "f34eed0bebedfc4b6ee51ced43d2c030fe3b92f13c149d072205ca200a67b1ec"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
